Skip to content

Conversation

kliuae-amd
Copy link

@kliuae-amd kliuae-amd commented Sep 26, 2025

Purpose

This PR targets the Qwen2.5-VL models, unifying the rotary embedding of query and key in ViT into one kernel adopted from flash_attn. The kernel speeds up Qwen2 ViT's rotary by 1.9x compared with the calculating q and k embeddings separately.

Kernel benchmark

python benchmarks/kernels/benchmark_vision_rotary_emb.py   --batch-size 2 --seq-len 16384 --num-heads 16 --head-size 80   --dtype bfloat16 --device cuda --warmup-iter 10 --benchmark-iter 1000
Config: batch=2, seqlen=16384, heads=16, head_dim=80, dtype=torch.bfloat16
Kernel mean runtime (ms) median runtime (ms)
1c, separated q and k 0.6679 0.6590
2c, fused q and k 0.3451 0.3416
Fusion speedup: 1.936x

Test Plan

Qwen2-VL-72B-Instruct correctness validation on MathVista

Server command:

VLLM_USE_TRITON_FLASH_ATTN=0 \
VLLM_USE_V1=1 \
VLLM_ROCM_USE_AITER=1 \
SAFETENSORS_FAST_GPU=1 \
VLLM_DISABLE_COMPILE_CACHE=1 \
vllm serve Qwen/Qwen2.5-VL-72B-Instruct \
    --tensor_parallel_size 2 \
    --trust_remote_code \
    --no_enable_prefix_caching \
    --enable_multimodal_encoder_data_parallel \
    --disable-mm-preprocessor-cache

Test Result

Eval

Command:
python3 -m eval.run eval_vllm \
        --model_name Qwen/Qwen2.5-VL-72B-Instruct \
        --url http://0.0.0.0:8088 \
        --output_dir temp_mistral_eval \
        --eval_name "mathvista"

Separated qk (before):

Metrics:                                                                                                                                                                                                    
{                                                                                                                                                                                                           
    "explicit_prompt_relaxed_correctness": 0.743,                                                                                                                                                           
    "anywhere_in_answer_relaxed_correctness": 0.78                                                                                                                                                          
}

Fused qk (after):

Metrics:
{
    "explicit_prompt_relaxed_correctness": 0.749,
    "anywhere_in_answer_relaxed_correctness": 0.787
}

Benchmark

Command:
vllm bench serve  \
  --backend openai-chat  \
  --endpoint-type openai-chat  \
  --model Qwen/Qwen2.5-VL-72B-Instruct  \
  --dataset-name hf  \
  --dataset-path lmarena-ai/VisionArena-Chat  \
  --hf-split train  \
  --endpoint /v1/chat/completions  \
  --num-prompts 1000  \
  --max-concurrency 64
Metric Separated qk Fused qk
Request throughput (req/s) 1.71 1.76
Output token throughput (tok/s) 199.55 205.61
Total token throughput (tok/s) 360.85 371.16
Mean TTFT (ms) 3250.92 3556.19
Median TTFT (ms) 1749.78 1784.01
P99 TTFT (ms) 23095.19 23362.85
Mean TPOT (ms) 304.71 293.01
Median TPOT (ms) 300.05 289.61
P99 TPOT (ms) 496.55 443.96
Mean ITL (ms) 297.91 284.26
Median ITL (ms) 46.29 45.83
P99 ITL (ms) 2721.27 3096.04

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: kliuae <[email protected]>
Signed-off-by: kliuae <[email protected]>
Signed-off-by: kliuae <[email protected]>
Signed-off-by: kliuae <[email protected]>
Signed-off-by: kliuae <[email protected]>
Signed-off-by: kliuae <[email protected]>
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

Signed-off-by: kliuae <[email protected]>
Signed-off-by: kliuae <[email protected]>
@tjtanaavllm
Copy link

LGTM

@tjtanaavllm tjtanaavllm merged commit bf15fd3 into ROCm:llama_fp8_03122025 Sep 30, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants